Authors
Affiliation

Alessandro Pizzi

University of Lausanne

Andrea Lovato

Ayman El Abed

Illia Dorofieiev

Published

December 4, 2024

Abstract

This is the abstract of the report. It should be a short summary of the project, the data, the analysis and the results. It should be concise and to the point. It should not be longer than 250 words.

How to include sections separately
  • You can use {include X} to include different sections of your report as separate .qmd files. This is also well documented in the Quarto documentation: https://quarto.org/docs/authoring/includes

  • As mentioned in the documentation, we have used (_) prefix for the included files (e.g., _introduction.qmd and _data.qmd). You should always use an underscore prefix with included files so that they are automatically ignored (i.e. not treated as standalone files) by a quarto render of a project (not absolutely necessary in your case, but highly recommended).

  • Rendering only report.qmd will render also all the other files.

1 Introduction

1.1 Project Goals

Obesity has emerged as one of the most pressing global health crises, with its prevalence nearly tripling worldwide since 1975, according to the World Health Organization (WHO). This alarming trend has fueled a dramatic rise in obesity-related diseases, including diabetes, cardiovascular conditions, and hypertension, imposing significant burdens on healthcare systems and economies. In Latin America and the Caribbean, the situation is particularly concerning: as of 2022, the Pan American Health Organization (PAHO) reported that nearly 25% of adults in the region are affected by obesity, emphasizing the urgent need for effective public health interventions. The crisis is especially acute in the countries central to this research. In 2018, Mexico recorded an adult obesity rate of 36.1%, while Peru and Colombia reported similarly worrisome rates of approximately 28% and 23%, respectively.

This widespread prevalence underscores the critical need for research focused on understanding and addressing the multifaceted factors contributing to obesity. In this context, the present study adopts an exploratory and primarily educational approach to examine the relationships between dietary habits, physical activity, and demographic variables, aiming to uncover their impact on obesity levels in Mexico, Peru, and Colombia. By leveraging a dataset consisting of 77% synthetically generated data (produced via the SMOTE algorithm) and 23% user-collected data from 498 participants, the research seeks to provide meaningful insights into this complex issue.

While the reliance on synthetic data and a non-representative sample limits direct real-world applicability, this study offers a unique opportunity to apply theoretical knowledge gained during the “Data Science in Business Analytics” course to a simulated scenario. By identifying patterns, correlations, and potential predictors of obesity, the research highlights the importance of data-driven approaches in addressing significant public health challenges. Ultimately, the findings aim to lay the groundwork for future studies and contribute to the development of informed public health strategies and healthcare policies, demonstrating the transformative potential of data analytics in managing and mitigating complex issues.

1.2 Research Questions

  • Question 1

    What are the key lifestyle and behavioral factors that significantly contribute to obesity in Mexico, Peru, and Colombia?

  • Question 2

    Can we predict whether a person will be obese based on some given combinations of factors?

  • Question 3

    How can these insights be effectively leveraged to inform public health initiatives and combat the escalating health crisis?

2 Data

2.1 Sources

The dataset utilized in this project was obtained from the UCI Machine Learning Repository, a reputable and extensively used platform for data science and machine learning projects. Originally compiled by researchers at the Universidad de la Costa, Colombia, the dataset combines 77% synthetically generated data with 23% real-world data collected through a structured online survey. The synthetic data, created using the Synthetic Minority Over-sampling Technique (SMOTE) in Weka, addresses class imbalance, enhancing the dataset’s suitability for machine learning tasks. The real-world data, gathered from 498 participants over a 30-day period, captures detailed self-reported information on dietary habits, physical activity levels, and demographic characteristics. While synthetic data introduces uniformity and balance, it inherently lacks the complexity of real-world variability, and the user-collected data, though authentic, is susceptible to self-reporting biases and sampling limitations. These characteristics, along with the dataset’s diverse origins, make it an invaluable resource for simulating real-world challenges in healthcare analytics.

2.2 Description

The dataset consists of 2111 records and 17 attributes, offering a detailed examination of the factors contributing to obesity. The attributes represent a mix of categorical and continuous variables, providing insights into demographic, lifestyle, and behavioral factors.

The variables include:

  • Gender (Categorical): indicates the gender of the individual (Male/Female).

  • Age (Continuous): represents the age of participants in years.

  • Height (Continuous): the height of individuals in meters.

  • Weight (Continuous): the weight of participants in kilograms.

  • Family History of Overweight (Categorical): indicates whether a family member has suffered from overweight (Yes/No).

  • Frequent Consumption of High-Caloric Food (FAVC) (Categorical): indicates if participants frequently consume high-caloric foods (Yes/No).

  • Frequency of Vegetable Consumption (FCVC) (Continuous): scaled from 1 to 3, reflects how often vegetables are consumed (1 = Never, 3 = Always).

  • Number of Main Meals per Day (NCP) (Continuous): indicates the typical number of main meals consumed daily.

  • Consumption of Food Between Meals (CAEC) (Categorical): describes how often participants eat between meals (e.g., No, Sometimes, Frequently, Always).

  • Smoking (SMOKE) (Categorical): indicates whether participants smoke (Yes/No).

  • Daily Water Consumption (CH2O) (Continuous): scaled from 1 to 3, reflecting daily water intake (1 = Less than 1 liter, 3 = More than 2 liters).

  • Calorie Monitoring (SCC) (Categorical): whether participants monitor their calorie intake (Yes/No).

  • Physical Activity Frequency (FAF) (Continuous): scaled from 0 to 4, indicating days of physical activity per week (0 = None, 4 = 4-5 days).

  • Time Using Technology Devices (TUE) (Continuous): reflects daily time spent on technological devices, in hours.

  • Alcohol Consumption (CALC) (Categorical): indicates the frequency of alcohol consumption (e.g., I don’t drink, Sometimes, Frequently, Always).

  • Transportation Method (MTRANS) (Categorical): describes the primary mode of transportation (e.g., Walking, Public Transportation, Automobile).

  • Obesity Level (NObeyesdad) (Categorical): the target variable, classifying obesity levels into categories such as Normal Weight, Overweight (Levels I and II), and Obesity (Types I, II, III).

The dataset has been pre-processed, with normalization applied to continuous variables and categorical data encoded. SMOTE was used to address class imbalance, but care was taken to minimize artificial patterns. Despite the presence of synthetic data (77%), which ensures balance and diversity, and real-world data (23%), which introduces authenticity, the dataset’s combined structure allows for a comprehensive analysis of obesity-related factors while acknowledging potential biases like self-report inaccuracies.

2.3 Wrangling

Import dataset.

Code
library(here)
library(knitr)
dataset_raw <- read.csv(here("data/raw/dataset_raw.csv"))
kable(head(dataset_raw), format = "markdown", caption = "First 6 Rows of dataset_raw")
First 6 Rows of dataset_raw
Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP CAEC SMOKE CH2O SCC FAF TUE CALC MTRANS NObeyesdad
Female 21 1.62 64.0 yes no 2 3 Sometimes no 2 no 0 1 no Public_Transportation Normal_Weight
Female 21 1.52 56.0 yes no 3 3 Sometimes yes 3 yes 3 0 Sometimes Public_Transportation Normal_Weight
Male 23 1.80 77.0 yes no 2 3 Sometimes no 2 no 2 1 Frequently Public_Transportation Normal_Weight
Male 27 1.80 87.0 no no 3 3 Sometimes no 2 no 2 0 Frequently Walking Overweight_Level_I
Male 22 1.78 89.8 no no 2 1 Sometimes no 2 no 0 0 Sometimes Public_Transportation Overweight_Level_II
Male 29 1.62 53.0 no yes 2 3 Sometimes no 2 no 0 0 Sometimes Automobile Normal_Weight

Load required libraries for data manipulation, visualization, and clustering. Each package serves a specific purpose:

  • dplyr: For data manipulation (e.g., filtering, summarizing).
  • tidyr: For data tidying (e.g., reshaping).
  • ggplot2: For visualization.
  • corrplot: For correlation matrix visualization.
  • ggridges: For creating ridge plots.
  • cluster: For clustering algorithms.
  • reshape2: For data reshaping, especially during visualization.
Code
library(dplyr)
library(tidyr)
library(ggplot2)
library(corrplot)
library(ggridges)
library(cluster)
library(reshape2)

We rename columns for clarity and ease of use in the analysis. The new names are shorter and more intuitive while preserving their original meaning.

Code
  dataset <- dataset_raw %>%
  rename(
    family_hist = family_history_with_overweight,
    obesity_lev = NObeyesdad,
    caloric_food = FAVC,
    vegetable_food = FCVC,
    nb_meal_day = NCP,
    food_btw_meals = CAEC,
    ch2o = CH2O,
    smoke = SMOKE,
    calorie_check = SCC,
    physical_act = FAF,
    freq_alcohol = CALC,
    use_tech = TUE,
    m_trans = MTRANS,
    gender = Gender,
    age = Age,
    weight = Weight,
    height = Height
  )

Check for missing values in the dataset, missing values are identified by counting NA values for each column.

Code
missing_values <- colSums(is.na(dataset))
kable(missing_values, format = "markdown", caption = "Missing Values in Each Column")
Missing Values in Each Column
x
gender 0
age 0
height 0
weight 0
family_hist 0
caloric_food 0
vegetable_food 0
nb_meal_day 0
food_btw_meals 0
smoke 0
ch2o 0
calorie_check 0
physical_act 0
use_tech 0
freq_alcohol 0
m_trans 0
obesity_lev 0

Missing values are identified by counting NA values for each column. All columns contain complete data, with no missing values. If missing data were present, we could address it by either removing rows with missing values using dataset <- na.omit(dataset_row) or imputing missing values with appropriate measures (e.g. mean or median).

Check the structure of the dataset to identify data types for each variable. This helps in identifying columns that need to be converted or standardized.

Code
# Capture the structure of the dataset
str_output <- capture.output(str(dataset))
# Convert the structure output to a data frame
str_table <- data.frame(Structure = str_output, stringsAsFactors = FALSE)

kable(str_table, format = "markdown", caption = "Structure of the Dataset")
Structure of the Dataset
Structure
‘data.frame’: 2111 obs. of 17 variables:
$ gender : chr “Female” “Female” “Male” “Male” …
$ age : num 21 21 23 27 22 29 23 22 24 22 …
$ height : num 1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 …
$ weight : num 64 56 77 87 89.8 53 55 53 64 68 …
$ family_hist : chr “yes” “yes” “yes” “no” …
$ caloric_food : chr “no” “no” “no” “no” …
$ vegetable_food: num 2 3 2 3 2 2 3 2 3 2 …
$ nb_meal_day : num 3 3 3 3 1 3 3 3 3 3 …
$ food_btw_meals: chr “Sometimes” “Sometimes” “Sometimes” “Sometimes” …
$ smoke : chr “no” “yes” “no” “no” …
$ ch2o : num 2 3 2 2 2 2 2 2 2 2 …
$ calorie_check : chr “no” “yes” “no” “no” …
$ physical_act : num 0 3 2 2 0 0 1 3 1 1 …
$ use_tech : num 1 0 1 0 0 0 0 0 1 1 …
$ freq_alcohol : chr “no” “Sometimes” “Frequently” “Frequently” …
$ m_trans : chr “Public_Transportation” “Public_Transportation” “Public_Transportation” “Walking” …
$ obesity_lev : chr “Normal_Weight” “Normal_Weight” “Normal_Weight” “Overweight_Level_I” …

We convert specific columns to factors for categorical interpretation during analysis. Factors ensure proper handling of discrete variables in statistical modeling.

We arranged the levels of the obesity categories, food consumption between meals, and the frequency of alcohol use to follow a logical ordinal progression, ensuring these variables accurately reflect increasing severity or frequency for improved interpretability and analysis.

Code
dataset <- dataset %>%
  mutate(
    gender = as.factor(gender),
    family_hist = as.factor(family_hist),
    caloric_food = as.factor(caloric_food),
    smoke = as.factor(smoke),
    calorie_check = as.factor(calorie_check),
    m_trans = as.factor(m_trans),
    obesity_lev = factor(obesity_lev, 
                         levels = c("Insufficient_Weight", "Normal_Weight", 
                                    "Overweight_Level_I", "Overweight_Level_II", 
                                    "Obesity_Type_I", "Obesity_Type_II", "Obesity_Type_III"), 
                         ordered = TRUE),
    food_btw_meals = factor(ifelse(food_btw_meals == "no", "No", food_btw_meals), 
                            levels = c("No", "Sometimes", "Frequently", "Always"), 
                            ordered = TRUE),
    freq_alcohol = factor(ifelse(freq_alcohol == "no", "No", freq_alcohol), 
                          levels = c("No", "Sometimes", "Frequently", "Always"), 
                          ordered = TRUE))

Using str() before and after confirms that each variable has the correct data type, preventing errors during analysis.

Code
str_output <- capture.output(str(dataset))
str_table <- data.frame(Structure = str_output, stringsAsFactors = FALSE)

kable(str_table, format = "markdown", caption = "Structure of the Dataset")
Structure of the Dataset
Structure
‘data.frame’: 2111 obs. of 17 variables:
$ gender : Factor w/ 2 levels “Female”,“Male”: 1 1 2 2 2 2 1 2 2 2 …
$ age : num 21 21 23 27 22 29 23 22 24 22 …
$ height : num 1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 …
$ weight : num 64 56 77 87 89.8 53 55 53 64 68 …
$ family_hist : Factor w/ 2 levels “no”,“yes”: 2 2 2 1 1 1 2 1 2 2 …
$ caloric_food : Factor w/ 2 levels “no”,“yes”: 1 1 1 1 1 2 2 1 2 2 …
$ vegetable_food: num 2 3 2 3 2 2 3 2 3 2 …
$ nb_meal_day : num 3 3 3 3 1 3 3 3 3 3 …
$ food_btw_meals: Ord.factor w/ 4 levels “No”<“Sometimes”<..: 2 2 2 2 2 2 2 2 2 2 …
$ smoke : Factor w/ 2 levels “no”,“yes”: 1 2 1 1 1 1 1 1 1 1 …
$ ch2o : num 2 3 2 2 2 2 2 2 2 2 …
$ calorie_check : Factor w/ 2 levels “no”,“yes”: 1 2 1 1 1 1 1 1 1 1 …
$ physical_act : num 0 3 2 2 0 0 1 3 1 1 …
$ use_tech : num 1 0 1 0 0 0 0 0 1 1 …
$ freq_alcohol : Ord.factor w/ 4 levels “No”<“Sometimes”<..: 1 2 3 3 2 2 2 2 3 1 …
$ m_trans : Factor w/ 5 levels “Automobile”,“Bike”,..: 4 4 4 5 4 1 3 4 4 4 …
$ obesity_lev : Ord.factor w/ 7 levels “Insufficient_Weight”<..: 2 2 2 3 4 2 2 2 2 2 …

Check for duplicated rows in the dataset.

Code
duplicated_rows <- sum(duplicated(dataset))
duplicated_rows
[1] 24

Keep only one instance of each duplicated row.

Code
dataset <- dataset %>%
  distinct()

Check the number of rows after removing duplicates.

Code
nrow(dataset)
[1] 2087
Code
any(duplicated(dataset))
[1] FALSE

In-depth analysis of SMOTE’s impact and visualization of class Distribution

Code
ggplot(dataset, aes(x = obesity_lev)) +
  geom_bar(fill = "skyblue", color = "black") +
  theme_minimal() +
  labs(
    title = "Class Distribution of Obesity Levels",
    x = "Obesity Level",
    y = "Count"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) #Adjusted the text for clarity

After applying SMOTE, the distribution is noticeably more balanced across all categories, with each class showing a similar count. This outcome reflects SMOTE’s intended effect of addressing class imbalance.

Distribution analysis

Density plot for age.

Code
ggplot(dataset, aes(x = age, fill = obesity_lev)) +
  geom_density(alpha = 0.5) +
  theme_minimal() +
    labs(
    title = "Age Distribution by Obesity Levels",
    x = "Age",
    y = "Density",
    fill = "Obesity Level") +
  xlim(14, 50) # Limit the x-axis to 0–50

This graph allows us to assess the age distribution across obesity levels and to evaluate the impact of the SMOTE algorithm in generating synthetic data. Two key takeaways emerge: first, the distributions show a clear separation between obesity categories, particularly with younger ages dominating in lower obesity levels (e.g., Insufficient Weight and Normal Weight) and older ages appearing more prominently in higher obesity levels (e.g., Obesity Type II and III). Second, sharp peaks, such as the one around age 30 in “Obesity Type I,” could signal potential artifacts from data synthesis. While these patterns indicate that the dataset maintains logical trends, further validation is necessary to confirm that these separations and peaks reflect realistic population characteristics and not artificial biases introduced during data augmentation. Overall, the dataset appears well-structured, but these observations warrant careful consideration during analysis.

Summary statistics by obesity level.

Code
dataset_stat <- dataset %>%
  group_by(obesity_lev) %>%
  summarize(
    Age_Mean = mean(age, na.rm = TRUE),
    Age_SD = sd(age, na.rm = TRUE),
    Height_Mean = mean(height, na.rm = TRUE),
    Height_SD = sd(height, na.rm = TRUE),
    Weight_Mean = mean(weight, na.rm = TRUE),
    Weight_SD = sd(weight, na.rm = TRUE)
  )
kable(dataset_stat,format = "markdown",caption = "Summary statistics by obesity level",digits = 1)
Summary statistics by obesity level
obesity_lev Age_Mean Age_SD Height_Mean Height_SD Weight_Mean Weight_SD
Insufficient_Weight 19.8 2.7 1.7 0.1 50.0 6.0
Normal_Weight 21.8 5.1 1.7 0.1 62.2 9.3
Overweight_Level_I 23.5 6.3 1.7 0.1 74.5 8.6
Overweight_Level_II 27.0 8.1 1.7 0.1 82.1 8.5
Obesity_Type_I 25.9 7.8 1.7 0.1 92.9 11.5
Obesity_Type_II 28.2 4.9 1.8 0.1 115.3 8.0
Obesity_Type_III 23.5 2.8 1.7 0.1 120.9 15.5

The summary statistics show relatively consistent means and standard deviations for Age, Height, and Weight across obesity levels, which suggests that SMOTE has preserved the overall distribution without introducing extreme values. Interpretation: Since the means and standard deviations are similar across classes, it appears SMOTE didn’t drastically alter the dataset’s variability. This consistency supports the idea that SMOTE effectively balanced the classes without distorting key variable distributions.

Perform K-means clustering and calculate silhouette score.

Code
library(cluster)
set.seed(123)
kmeans_res <- kmeans(select(dataset, where(is.numeric)), centers = length(unique(dataset$obesity_lev)))
silhouette_score <- silhouette(kmeans_res$cluster, dist(select(dataset, where(is.numeric))))
mean_silhouette_score <- mean(silhouette_score[, "sil_width"])
mean_silhouette_score
[1] 0.4513519

Silhouette Score from K-means Clustering: The mean silhouette score of approximately 0.456 suggests a moderate level of cohesion within clusters and some separation between them. This score indicates that the clusters (representing obesity levels) are neither too distinct nor too blended. Interpretation: A score close to 0.5 generally reflects reasonable class separability without excessive artificial separability. This score suggests that SMOTE has helped create distinguishable but not overly isolated clusters, which is desirable for class balance. We conclude that SMOTE has balanced the dataset without drastically distorting it.

Creating a Numerical Dataset “dataset_num”.

Code
dataset_num <- dataset %>%
  mutate(obesity_lev = recode(obesity_lev,
                              "Insufficient_Weight"=1,
                              "Normal_Weight" = 2,
                              "Overweight_Level_I" = 3,
                              "Overweight_Level_II" = 4,
                              "Obesity_Type_I" = 5,
                              "Obesity_Type_II" = 6,
                              "Obesity_Type_III" = 7,
  ))

dataset_num <- dataset %>%
  mutate(freq_alcohol = recode(freq_alcohol,
                               "No"=1,        
                               "Sometimes"=2, 
                               "Frequently" =3,
                               "Always"  =4 
  ))

dataset_num <- dataset %>%
  mutate(m_trans = recode(m_trans,
                          "Automobile"=1,
                          "Bike"=2,
                          "Motorbike"=3,
                          "Public_Transportation"=4,
                          "Walking"=5,
  ))

dataset_num <- dataset %>%
  mutate(food_btw_meals = recode(food_btw_meals,
                                 "No"=0,
                                 "Sometimes"=1 ,
                                 "Frequently"=2,
                                 "Always"=3
  )
  )

dataset_num <- dataset_num%>%
  mutate(calorie_check = recode(calorie_check,
                                "no"=0,
                                "yes"=1 ,
  ))

dataset_num <- dataset_num %>%
  mutate(across(where(is.factor), ~ as.numeric(.)))

str(dataset_num)
'data.frame':   2087 obs. of  17 variables:
 $ gender        : num  1 1 2 2 2 2 1 2 2 2 ...
 $ age           : num  21 21 23 27 22 29 23 22 24 22 ...
 $ height        : num  1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
 $ weight        : num  64 56 77 87 89.8 53 55 53 64 68 ...
 $ family_hist   : num  2 2 2 1 1 1 2 1 2 2 ...
 $ caloric_food  : num  1 1 1 1 1 2 2 1 2 2 ...
 $ vegetable_food: num  2 3 2 3 2 2 3 2 3 2 ...
 $ nb_meal_day   : num  3 3 3 3 1 3 3 3 3 3 ...
 $ food_btw_meals: num  1 1 1 1 1 1 1 1 1 1 ...
 $ smoke         : num  1 2 1 1 1 1 1 1 1 1 ...
 $ ch2o          : num  2 3 2 2 2 2 2 2 2 2 ...
 $ calorie_check : num  0 1 0 0 0 0 0 0 0 0 ...
 $ physical_act  : num  0 3 2 2 0 0 1 3 1 1 ...
 $ use_tech      : num  1 0 1 0 0 0 0 0 1 1 ...
 $ freq_alcohol  : num  1 2 3 3 2 2 2 2 3 1 ...
 $ m_trans       : num  4 4 4 5 4 1 3 4 4 4 ...
 $ obesity_lev   : num  2 2 2 3 4 2 2 2 2 2 ...
Code
head(dataset_num)
  gender age height weight family_hist caloric_food vegetable_food nb_meal_day
1      1  21   1.62   64.0           2            1              2           3
2      1  21   1.52   56.0           2            1              3           3
3      2  23   1.80   77.0           2            1              2           3
4      2  27   1.80   87.0           1            1              3           3
5      2  22   1.78   89.8           1            1              2           1
6      2  29   1.62   53.0           1            2              2           3
  food_btw_meals smoke ch2o calorie_check physical_act use_tech freq_alcohol
1              1     1    2             0            0        1            1
2              1     2    3             1            3        0            2
3              1     1    2             0            2        1            3
4              1     1    2             0            2        0            3
5              1     1    2             0            0        0            2
6              1     1    2             0            0        0            2
  m_trans obesity_lev
1       4           2
2       4           2
3       4           2
4       5           3
5       4           4
6       1           2

2.3.1 Spotting Mistakes and Missing Data

We verified the presence of any potential NA values that might have arisen during the conversion of categorical variables to numeric format.

Code
colSums(is.na(dataset_num))
        gender            age         height         weight    family_hist 
             0              0              0              0              0 
  caloric_food vegetable_food    nb_meal_day food_btw_meals          smoke 
             0              0              0              0              0 
          ch2o  calorie_check   physical_act       use_tech   freq_alcohol 
             0              0              0              0              0 
       m_trans    obesity_lev 
             0              0 

The results of the test confirmed that there are no NA values in the dataset, indicating that all variables were successfully converted to numeric format while retaining their integrity.

2.3.2 Listing Anomalies and Outliers

2.4 Correlation Analysis

In order to select the possible factor influencing obesity level.

We computed a correlation matrix to analyze the relationships between numeric variables, focusing on their associations with obesity_lev. Variables were reordered by the strength of their correlation with obesity_lev for clarity. A heatmap was generated using a diverging color gradient to visualize these correlations, with red indicating strong positive relationships, blue for negative, and white for weak or neutral. Numerical labels and rotated axis labels were added to improve interpretability, highlighting key factors linked to obesity levels.

Code
#Assuming dataset_num is already defined and contains the relevant columns
cor_matrix <- cor(dataset_num %>%
                    select("physical_act", "freq_alcohol", "obesity_lev", "age",
                           "weight","height", "family_hist", "caloric_food",
                           "vegetable_food", "food_btw_meals", "use_tech", "ch2o",
                           "m_trans", "smoke","nb_meal_day", "calorie_check",
                           "gender"),
                  use = "complete.obs")

#Extract the correlations with 'obesity_lev'
cor_with_obesity_lev <- cor_matrix["obesity_lev",]

#Order variables by their correlation with 'obesity_lev'
ordered_vars <- names(sort(cor_with_obesity_lev, decreasing = TRUE))

#Reorder the correlation matrix based on this order
cor_matrix_ordered <- cor_matrix[ordered_vars, ordered_vars]

#Melt the ordered correlation matrix into long format
cor_long <- melt(cor_matrix_ordered)

ggplot(cor_long, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  geom_text(aes(label = round(value, 2)), color = "black", size = 2.5, vjust = 0.5
            , hjust = 0.5) + # Center text within tiles
    scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0) +
  
  labs(title = "Correlation Heatmap Ordered by Obesity Level", x = "Variables", y
       = "Variables") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        # Rotate x-axis labels for readability
        axis.text.y = element_text(angle = 45, vjust = 1) 
        # Rotate y-axis labels for readability
  )

Code
# Create the heatmap with correlation values

# Assuming dataset_num is already defined and contains the relevant columns
cor_matrix <- cor(dataset_num %>%
                    select("physical_act", "freq_alcohol", "obesity_lev", "age",
                           "weight", "family_hist", "caloric_food",
                           "vegetable_food", "food_btw_meals",
                           "use_tech","ch2o", "height",
                           "calorie_check", "gender"),
                  use = "complete.obs")

# Extract the correlations with 'obesity_lev'
cor_with_obesity_lev <- cor_matrix["obesity_lev",]

# Order variables by their correlation with 'obesity_lev'
ordered_vars <- names(sort(cor_with_obesity_lev, decreasing = TRUE))

# Reorder the correlation matrix based on this order
cor_matrix_ordered <- cor_matrix[ordered_vars, ordered_vars]

# Melt the ordered correlation matrix into long format
cor_long <- melt(cor_matrix_ordered)


ggplot(cor_long, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  geom_text(aes(label = round(value, 2)), color = "black", size = 2.5, vjust = 0.5
            , hjust = 0.5) + # Center text within tiles
  scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0) +
  labs(title = "Correlation Heatmap Ordered by Obesity Level", x = "Variables", y
       = "Variables") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        # Rotate x-axis labels for readability
        axis.text.y = element_text(angle = 45, vjust = 1) 
        # Rotate y-axis labels for readability
  )

3 3. Exploratory Data Analysis (EDA)

3.0.0.1 3.1 Descriptive statistics and distribution analysis

3.0.0.1.1 Age

Descriptive statistic for Age

Code
summary(dataset$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  14.00   19.92   22.85   24.35   26.00   61.00 
Code
sd(dataset$age, na.rm = TRUE)
[1] 6.368801

Age distribution

The age data shows a right-skewed distribution, with a mean of 24.3 years and a median of 22.78 years. The range (14 to 61 years) covers a wide age span, but most individuals are concentrated in the 20–30 age range. The standard deviation (6.35 years) suggests moderate variability in the dataset. This young population distribution may limit the applicability of results to older age groups, where obesity risk factors could differ.

Age Distribution by Obesity Level (Violin Plot)

The age distribution varies across obesity levels,highlighting distinct trends. Insufficient and normal weight categories are concentrated among younger individuals (14–30), while overweight and obesity levels shift towards mid-adulthood (20–40), peaking around 30–35 years. Severe obesity (Type III) is rare in younger ages and more common in the 30–40 range. These patterns suggest the progression of weight issues with age and emphasize the need for targeted interventions during early to mid-adulthood to prevent worsening obesity levels.

Code
ggplot(dataset, aes(x = obesity_lev, y = age, fill = obesity_lev)) +
  geom_violin(trim = FALSE, alpha = 0.6) +
  geom_boxplot(width = 0.1, color = "black", fill = "white") +
  labs(title = "Age Distribution by Obesity Level", x = "Obesity Level", y = "Age") +
  theme_minimal() +
   theme(axis.text.x = element_text(angle = 45, hjust = 1))

The violin plot shows, more clearly, how individuals in the lower obesity categories, such as insufficient and normal weight, are predominantly younger, with ages concentrated between 14 and 30 years. In contrast, higher obesity levels exhibit a broader age range, with a peak density observed around 30–40 years, particularly in Obesity Type I and Type II. Severe obesity (Type III) is rare in younger individuals and becomes more prominent in the mid-adulthood age group. This visualization underscores the gradual progression of obesity risk with age and emphasizes the critical need for early intervention strategies to address weight-related health issues, particularly during early and mid-adulthood when such risks become more pronounced.

Age Distribution with SMOOTH Trend Line for Obesity Probability.

Code
ggplot(dataset, aes(x = age, y = as.numeric(obesity_lev))) +
  geom_jitter(alpha = 0.3) +
  geom_smooth(method = "loess", se = FALSE, color = "blue") +
  labs(title = "Trend of Obesity Level with Age", x = "Age", y = "Obesity Level") +
  theme_minimal()

The graph shows a smooth trend line capturing the overall pattern. Obesity levels increase significantly from adolescence to early adulthood, peaking around the 25–30 years age range. This period potentially represents a critical transition, where lifestyle factors such as reduced physical activity, higher caloric intake, and metabolic changes can contribute to the steep rise in obesity levels.

Beyond the peak, the trend shows a gradual decline in obesity levels after 30 years, which may reflect behavioral changes, such as increased health awareness, dietary improvements, or a selection bias in older age groups. This switch suggests that mid-20s to early-30s is a pivotal stage for interventions aimed at mitigating obesity risk.

3.0.0.1.2 Height

Descriptive statistic for Height.

Code
summary(dataset$height)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.450   1.630   1.702   1.703   1.769   1.980 
Code
sd(dataset$height, na.rm = TRUE)
[1] 0.09318594

Height distribution.

Code
ggplot(dataset, aes(x = height)) +
  geom_histogram(bins = 20, fill = "purple", color = "black", alpha = 0.7) +
  labs(title = "Height Distribution", x = "Height (m)", y = "Count") +
  theme_minimal()

The height histogram shows the height distribution (in meters) and is approximately normal, with a slight right skew. Most values fall between 1.45m and 1.98m, with a peak around 1.8m, indicating it’s the most frequent height. The range is realistic, with no visible extreme outliers, and the standard deviation (0.09) indicates low variability. I would like to add that the mean and median are both 1.7m, confirming a nearly symmetrical distribution.

Height by Obesity Level

Box Plot of Height by Obesity Level.

Code
ggplot(dataset, aes(x = obesity_lev, y = height, fill = obesity_lev)) +
  geom_violin(alpha = 0.6) +
  labs(title = "Height Distribution by Obesity Level", x = "Obesity Level", y = "Height") +
  theme_minimal() +
  theme(legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1))

The plot shows for height, relatively low variability within each category, with overlapping ranges between most groups. Individuals with Insufficient Weight and Normal Weight have slightly narrower distributions, centered around similar heights (~1.7 m). As obesity levels increase (e.g., Obesity Type I–III), the distributions remain consistent, suggesting height is not strongly associated with obesity classification. This suggests that weight may be more influential than height alone in determining obesity level.

3.0.0.1.3 Weight

Descriptive statistic for Weight.

Code
summary(dataset$weight)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  39.00   66.00   83.10   86.86  108.02  173.00 
Code
sd(dataset$weight, na.rm = TRUE)
[1] 26.19085

Weight by gender

Density plot for weight distribution by gender.

Code
ggplot(dataset, aes(x = weight, fill = gender)) +
  geom_density(alpha = 0.5) +
  labs(title = "Density Plot of Weight by Gender", x = "Weight", y = "Density") +
  scale_fill_manual(values = c("pink", "lightblue"), name = "Gender", labels = c("Female", "Male")) +
  theme_minimal()

The density plot reveals distinct weight distributions between genders. Females generally weight less, with a peak around 70 units, while males peak around 85 and 115 units, indicating a tendency toward higher weights. The overlapping region around 80-90 units shows weights common to both genders, but the distinct density peaks emphasize gender-based differences in weight distribution. Overall, males dominate at higher ranges Weight ranges from 39 to 173 units, with an average (mean) weight of 86.6 units. The median weight is 83 units, with a standard deviation of 26.2, indicating moderate spread.

Weight by obesity level

Ridgeline Plot of Weight by Obesity Level.

Code
ggplot(dataset, aes(x = weight, y = obesity_lev, fill = obesity_lev)) +
  geom_density_ridges(scale = 0.9, alpha = 0.6) +
  labs(title = "Ridgeline Plot of Weight by Obesity Level", x = "Weight", y = "Obesity Level") +
  theme_minimal() +
  theme(legend.position = "none")

This ridgeline plot shows a clear progression in weight distribution across different obesity levels. As the obesity level increases, the weight distribution shifts progressively to higher ranges. “Normal Weight” and “Insufficient Weight” categories are concentrated at lower weights, while higher obesity types (I, II, and III) peak at significantly greater weights, indicating a strong positive association between weight and obesity level The weight distribution has an average of 86.6 kg and a standard deviation of 26.6 kg.

3.0.0.1.4 Height and Weight

Scatter Plot (height vs weight), colored by obesity level.

Code
ggplot(dataset, aes(x = height, y = weight, color = obesity_lev)) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE, aes(group = obesity_lev)) +  # Adds a trend line for each obesity level
  ggtitle("Scatter Plot of Weight vs Height by Obesity Level") +
  theme_minimal() +
  labs(x = "Height", y = "Weight", color = "Obesity Level")

Facet Grid for Height and Weight by Obesity Level.

Code
ggplot(dataset, aes(x = height, y = weight)) +
  geom_point(alpha = 0.7, aes(color = obesity_lev)) +
  facet_wrap(~ obesity_lev) +
  ggtitle("Facet Grid of Weight and Height by Obesity Level") +
  theme_minimal() +
  labs(x = "Height", y = "Weight", color = "Obesity Level") +
  theme(legend.position = "none")

The scatter plot with trend lines for each obesity level reveals a clear positive correlation between weight and height across all obesity levels. As the obesity level increases, the slope generally becomes steeper, indicating a stronger weight gain relative to height. We created the facet grid to show more clearly the trends to show more clearly how The “Obesity_Type_III” (yellow) category has the steepest slope, suggesting a significant weight increase per unit of height, which is consistent with the highest level of obesity.

Correlation between height and weight.

Code
correlation_height_weight <- cor(dataset$height, dataset$weight, use = "complete.obs")
correlation_height_weight
[1] 0.457468

The correlation observed between height and weight (r = 0.463) aligns with existing literature, confirming the expected positive relationship between these variables.

3.0.0.1.5 Food between meals
Code
# Dodged Bar Chart for food_btw_meals by obesity levels
ggplot(dataset, aes(x = food_btw_meals, fill = obesity_lev)) +
   geom_bar(position = "dodge", color = "black") +
   ggtitle("Dodged Bar Chart for Food Between Meals by Obesity Levels") +
   labs(x = "Food Between Meals", y = "Count", fill = "Obesity Levels") +
   theme_minimal() +
   theme(
         plot.title = element_text(hjust = 0.5, size = 14))

Code
# Stacked Bar Chart of Food Between Meals by Obesity Level (Proportions within each Obesity Level)
ggplot(dataset, aes(x = obesity_lev, fill = food_btw_meals)) +
    geom_bar(position = "fill") + # Stacked bar chart with proportions
    scale_y_continuous(labels = scales::percent_format(accuracy = 1)) + # Format y-axis as percentages
    ggtitle("Proportion of Food Between Meals Across Obesity Levels") + # Shortened and clear title
    labs(x = "Obesity Levels", y = "Proportion (%)", fill = "Food Between Meals") + # Correct axis and legend labels
    theme_minimal() +
    theme(
        axis.text.x = element_text(angle = 45, hjust = 1), # Rotate x-axis text for readability
        plot.title = element_text(hjust = 0.5, size = 14) # Center and style the title
    )

These charts provide a clear illustration of how the frequency of eating between meals varies across obesity levels. The most dominant behavior across all categories is “Sometimes,” which peaks in intermediate levels like Normal Weight and Overweight Level I, reflecting a common pattern of moderate snacking. However, as obesity levels increase to Obesity Types I–III, the responses for “Frequently” and “Always” diminish, while “Sometimes” becomes even more prevalent. This shift could indicate that higher obesity levels are more associated with habitual moderate snacking rather than excessive meal-snacking frequency. On the other hand, “No” responses remain negligible across all obesity levels, suggesting that eating between meals is almost universal in this population. This pattern underscores the importance of examining not just the frequency but also the quality and context of snacking as potential contributors to obesity progression.

3.0.0.1.6 High-caloric food consumption
Code
# Dodged Bar Chart for High-Caloric Food Consumption by Obesity Levels
ggplot(dataset, aes(x = caloric_food, fill = obesity_lev)) +
   geom_bar(position = "dodge", color = "black") +
   ggtitle("    Dodged Bar Chart for High-Caloric Food Consumption by Obesity Levels") +
   labs(x = "High-Caloric Food Consumption", y = "Count", fill = "Obesity Levels") +
   theme_minimal() +
   theme(
         plot.title = element_text(hjust = 0.5, size = 14)) # Center and style the title

The dodged bar chart clearly shows that the majority of individuals, especially in the higher obesity categories (Obesity Type I–III), report consuming high-caloric foods (“yes”). This trend becomes increasingly pronounced as obesity levels rise, with very few individuals reporting “no” consumption in these categories. In contrast, lower obesity levels (e.g., Normal Weight, Overweight Level I) show a slightly higher representation of “no” responses, indicating a potential shift in dietary habits across obesity levels.

Code
# Grouped Bar Chart of High-Caloric Food by Obesity Level (Proportions within each Obesity Level)
ggplot(dataset, aes(x = obesity_lev, fill = caloric_food)) +
  geom_bar(position = "dodge", aes(y = (..count..) / tapply(..count.., ..x.., sum)[..x..]), color = "black") +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  ggtitle("                                 Grouped Bar Chart of High-Caloric Food Consumption Across Obesity Levels") +
  labs(x = "Obesity Levels", y = "Proportion (%)", fill = "High-Caloric Food Consumption") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(hjust = 0.5, size = 14)
  )
Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(count)` instead.

The grouped bar chart effectively shows the behavioral shift toward higher high-caloric food consumption as obesity levels increase. High-caloric food consumption (“yes”) consistently accounts for over 75% of responses, becoming nearly universal in higher obesity categories (Obesity Type I–III). In contrast, “no” responses are more visible in lower obesity levels, such as Insufficient Weight and Normal Weight, but remain a minority.

Code
percentage_high_caloric_consumers <- mean(dataset$caloric_food == "yes") * 100
percentage_high_caloric_consumers
[1] 88.35649

More precisely, a notable 88.4% of participants report frequent consumption of high-calorie foods, which may directly contribute to weight gain, highlighting the need for dietary interventions focused on reducing high-calorie intake.

3.0.0.1.7 Alcohol consumption

Frequence in consumption of alcohol.

Code
# Filter out "Always" responses from the dataset
filtered_dataset <- dataset %>%
  filter(freq_alcohol != "Always")

# Dodged Bar Chart for freq_alcohol by Obesity Levels (excluding "Always")
ggplot(filtered_dataset, aes(x = freq_alcohol, fill = obesity_lev)) +
   geom_bar(position = "dodge", color = "black") +
   ggtitle("Dodged Bar Chart for Alcohol Consumption by Obesity Levels") +
   labs(x = "Alcohol Consumption Frequency", y = "Count", fill = "Obesity Levels") +
   theme_minimal() +
   theme(
         plot.title = element_text(hjust = 0.5, size = 14)) # Center and style the title

The chart shows that “Sometimes” is the dominant alcohol consumption frequency across all obesity levels, particularly in Normal Weight, Overweight Level I, and II categories. As obesity increases, “Frequently” becomes slightly more prominent, especially in Obesity Type III, while “No” responses decrease, being more common in lower obesity levels such as Insufficient and Normal Weight. The “Always” responses are excluded from this chart due to their near absence in the dataset, highlighting that excessive alcohol consumption is rare. This trend underlines the potential relationship between moderate-to-frequent alcohol consumption and higher obesity levels, emphasizing its importance for obesity-related behavioral research.

Code
# Prepare the data summary for 'Sometimes' and 'No' responses
data_summary <- dataset %>%
  filter(freq_alcohol %in% c("Sometimes", "No")) %>%
  group_by(obesity_lev, freq_alcohol) %>%
  summarise(count = n(), .groups = "drop") %>%
  group_by(obesity_lev) %>%
  mutate(
    total = sum(count),
    proportion = count / total
  ) %>%
  ungroup()

# Visualization with updated title
ggplot(data_summary, aes(x = obesity_lev, y = proportion, group = freq_alcohol, color = freq_alcohol)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 3) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +  # Format y-axis as percentages
  ggtitle("Proportion of 'Sometimes' and 'No' Alcohol Responses by Obesity Level") +
  labs(x = "Obesity Level", y = "Proportion (%)", color = "Alcohol Frequency") +
  scale_color_manual(values = c("No" = "purple", "Sometimes" = "gold")) + # Improved color scheme
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(hjust = 0.5, size = 14),  # Center and style title
    legend.position = "top"
  )

The proportion of individuals who drink alcohol “Sometimes” increases with higher obesity levels, peaking in Obesity_Type_III. In contrast, the likelihood of abstaining from alcohol (“no”) decreases as obesity levels rise. This pattern suggests that moderate alcohol consumption may be associated with higher obesity levels, while abstention is more common among those with lower obesity levels.

A possible interaction to investigate later is between alcohol frequency and caloric food preference, as both behaviors seem linked to higher obesity levels. Exploring this could reveal if individuals with a preference for caloric foods and moderate alcohol consumption have a compounding effect on obesity risk. This investigation could help clarify whether combined lifestyle factors contribute more significantly to higher obesity levels than each factor alone.

Monitoring of the calories in the day.

Code
# Dodged Bar Chart for calorie_check by Obesity Levels
ggplot(dataset, aes(x = calorie_check, fill = obesity_lev)) +
   geom_bar(position = "dodge", color = "black") +
   ggtitle("    Dodged Bar Chart for the check of the calories by Obesity Levels") +
   labs(x = "High-Caloric Food Consumption", y = "Count", fill = "Obesity Levels") +
   theme_minimal() +
   theme(
         plot.title = element_text(hjust = 0.5, size = 14)) # Center and style the title

Code
data_summary <- dataset %>%
  group_by(obesity_lev, calorie_check) %>%
  summarise(count = n(), .groups = "drop") %>%
  mutate(total = sum(count), proportion = count / total)

ggplot(data_summary, aes(x = obesity_lev, y = proportion, group = calorie_check, color = calorie_check)) +
  geom_line(size = 1.2) +
  geom_point(size = 3) +
  scale_y_continuous(labels = scales::percent) +
  scale_color_manual(values = c("no" = "lightcoral", "yes" = "lightblue")) +
  labs(title = "Proportion of Calorie Checking by Obesity Level", x = "Obesity Level", y = "Proportion", color = "Calorie Check") +
  theme_minimal() +
  theme(legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

The Dodged Bar Chart highlights two main trends regarding calorie-checking behavior across obesity levels: a significant increase in “Yes” responses as obesity levels rise, particularly from Overweight Level II onward, and a decrease in “No” responses, which are more prevalent in lower obesity levels like Normal Weight and Insufficient Weight. The second graph simplifies these trends by clearly illustrating the proportional shift between “Yes” and “No” responses, making the contrast between lower and higher obesity levels more visually apparent. Together, these visualizations emphasize a potential association between obesity severity and an increased tendency to check calorie intake, suggesting heightened dietary awareness in higher obesity categories.

3.0.0.1.8 Vegetable consumption
Code
ggplot(dataset, aes(x = vegetable_food)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "lightgreen", color = "black", alpha = 0.6) +
  geom_density(color = "darkgreen", size = 1) +
  ggtitle("Histogram and Density of Vegetable Food Consumption") +
  theme_minimal() +
  labs(x = "Vegetable Food Consumption", y = "Density")

Code
ggplot(dataset, aes(x = weight, y = vegetable_food, color = obesity_lev)) +
    geom_point(alpha = 0.6) +
    geom_smooth(method = "loess", se = FALSE, color = "black") +
    labs(title = "Scatterplot of Weight vs Vegetable Food Consumption", 
         x = "Weight", 
         y = "Vegetable Food Consumption") +
    theme_minimal() +
    coord_cartesian(xlim= c(40, 135), ylim= c(2, 3))

The scatterplot provided with the trend line illustrates a distinct, non-linear relationship: vegetable consumption initially decreases as weight increases but then begins to rise again at higher weight levels.

This pattern suggests that individuals with lower weight, particularly those in the Insufficient Weight and Normal Weight categories, tend to report higher vegetable consumption. As weight progresses toward the Overweight categories, vegetable consumption decreases slightly, indicating a possible reduction in healthy dietary habits. However, at the upper end of the weight spectrum, corresponding to Obesity Type II and Obesity Type III, vegetable consumption increases again, potentially due to dietary interventions or awareness in this group.

The trend reveals two possible key insights:

  • A dip in vegetable consumption occurs in intermediate weight ranges, aligning with the overweight population.
  • The sharp increase in vegetable consumption among the most obese individuals may reflect lifestyle adjustments prompted by health concerns or medical advice.
3.0.0.1.9 Physical activity

Plot histogram and density.

Code
ggplot(dataset, aes(x = physical_act)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "skyblue", color = "black", alpha = 0.6) +
  geom_density(color = "darkblue", size = 1) +
  ggtitle("Histogram and Density of Physical Activity") +
  theme_minimal() +
  labs(x = "Physical Activity", y = "Density")

The histogram and density plot reveal that physical activity levels have distinct peaks at 0, 1, 2, and 3, suggesting that these values are common reported levels. Intermediate values, likely due to synthetic data or SMOTE, are also present but less frequent.

Violin plot by category.

Code
ggplot(dataset, aes(x = obesity_lev, y = physical_act, fill = obesity_lev)) +  # Replace 'obesity_lev' with any category variable
  geom_violin(trim = FALSE) +
  geom_boxplot(width = 0.1, color = "black", fill = "white") +
  ggtitle("Violin Plot of Physical Activity by Obesity Level") +
  theme_minimal() +
  labs(x = "Obesity Level", y = "Physical Activity") +
  theme(legend.position = "none") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Physical activity levels show a slight decline as obesity levels increase, particularly evident in the narrowing distributions and lower medians observed for Obesity Type II and Obesity Type III categories. In contrast, the Insufficient Weight and Normal Weight groups exhibit higher physical activity levels, as reflected by their broader and more symmetrical distributions.

The graph reveals a distinct trend: individuals in lower obesity categories engage in more physical activity compared to those in higher obesity categories. This trend suggests an inverse relationship between physical activity and obesity levels.

3.0.0.1.10 Water consumption

Plot histogram and density for water consumption.

Code
ggplot(dataset, aes(x = ch2o)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "skyblue", color = "black", alpha = 0.6) +
  geom_density(color = "darkblue", size = 1) +
  ggtitle("Histogram and Density of Comsumption of Water") +
  theme_minimal() +
  labs(x = "CH2O", y = "Density")

This histogram and density plot of daily water consumption (CH2O) shows a clear peak at 2 liters, indicating that most individuals consume around this amount. This aligns with scientific literature, which generally recommends an average daily water intake of about 2 liters for optimal health.

Violin Plot by Gender.

Code
# Scatterplot with a LOESS trend line
ggplot(dataset, aes(x = weight, y = ch2o, color = obesity_lev)) +
    geom_point(alpha = 0.6) +
    geom_smooth(method = "loess", se = FALSE, color = "black") +
    labs(title = "Scatterplot of Weight vs Water Consumption", x = "Weight", y = "Water Consumption (ch2o)") +
    theme_minimal() +
coord_cartesian(xlim= c(35, 135))

The scatterplot visualizes the relationship between weight and water consumption (ch2o), categorized by obesity levels. The trend line reveals a slightly increasing pattern of water consumption as weight increases, though the relationship is relatively weak and mostly linear.

This pattern suggests that individuals with Insufficient Weight and Normal Weight categories generally report slightly lower water consumption compared to individuals in the higher weight categories, such as Obesity Type II and III. The increase in water consumption among higher weight groups could indicate attempts to adopt healthier habits or increased hydration needs due to larger body sizes. However, the relatively flat trend across most weight ranges suggests that water consumption does not vary dramatically across different weight categories, highlighting a potential area for targeted interventions to promote hydration as a component of healthy dietary behavior.

3.0.0.1.11 Technology utilization

Histogram with Density.

Code
ggplot(dataset, aes(x = use_tech)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "lightblue", color = "black", alpha = 0.6) +
  geom_density(color = "blue", size = 1) +
  labs(title = "Histogram and Density of Use of Technology", x = "Use of Technology", y = "Density") +
  theme_minimal()

Density of Use of Technology by Obesity Level.

Code
ggplot(dataset, aes(x = use_tech, fill = obesity_lev)) +
  geom_density(alpha = 0.5) +
  labs(title = "Density of Use of Technology by Obesity Level", x = "Use of Technology", y = "Density") +
  theme_minimal()

This density plot provides a perspective on the use of technology across different obesity levels. A striking feature is the sharp, dominant peak in Obesity Type III (yellow) around the value of 1. This pattern diverges notably from the smoother and more evenly distributed curves seen in other obesity categories, suggesting a unique behavioral trend in this group.

The peak indicates a strong clustering of individuals in Obesity Type III who report moderate use of technology, which may reflect consistent engagement with technology-based activities such as sedentary work, entertainment, or even health-monitoring applications. In contrast, other obesity categories, such as Obesity Type II and Overweight Level II, exhibit more balanced distributions without a single dominant peak, hinting at more varied technology usage patterns.

This observation raises interesting questions about the role of technology in shaping lifestyle behaviors in Obesity Type III individuals. It may point to a reliance on technology that correlates with a sedentary lifestyle, a known risk factor for obesity. Alternatively, it could reflect targeted interventions or habits specific to this group.

3.0.1 4. Analysis overview of statistical methods and model selection

In the present analytical endeavor, we plan to employe a regression model approach to elucidate the intricate dynamics between a set of independent variables, which serve as the predictors, and Obesity Level as a singular dependent variable which is the outcome.

The rationale behind the selection of regression modeling stems from its established robustness as a statistical methodology, particularly adept at unraveling and quantifying the interrelations among variables. This is paramount, considering our overarching objective to forecast outcomes and to meticulously evaluate the repercussions that alterations in the predictor variables may have on the target variable.

Based on our exploratory data analysis, indications of potential outliers emerged within our dataset. However, upon closer examination, these values represent extreme data points that remain plausible given the context of our study. Consequently, our approach involves building two regression models: one that includes these extreme values and one that excludes them. The objective is to examine the impact of these extreme data points on the predictive performance of the model, analyzing how their presence or absence influences the resulting predictions and model behavior.

The idea behind the adoption of regression analysis is twofold. Firstly, it affords a nuanced understanding of the extent to which each predictor influences the outcome. Secondly, it provides a suite of statistical metrics that facilitate the evaluation of the model’s capacity to elucidate the variance in the data. Through regression analysis, we can ascertain the presence of statistically significant linkages between the variables under scrutiny and quantify the magnitude and trajectory of these associations. This method endows us with coefficients that reflect the anticipated alteration in the dependent variable corresponding to a unit change in the predictors, whilst controlling for the constancy of other variables. That answers the first part of our research question. In top of that the regression model will permit us to ascertain the extent to which our independent variables account for the variability observed in the dependent variable (Assess Predictive Power), but we also will be able to delineate the individual impact magnitudes exerted by each predictor variable and to validate the statistical significance of these effects(Quantify Effects:).

At last, by integrating pertinent covariates and control variables into the model, we aim to attenuate biases and segregate the influence of the primary predictors on the outcome, thereby enhancing the accuracy of our findings.By looking at R², the P-values and the standardized coefficients we should be able to understand what are the key factors that can influence the weight condition of a person( obesity level).

To ensure the performance of the model we will need to check the linearity between the values, the normality of residuals and the homogeneity of Variance. And lastly we will check which non significant variable Variance Inflation Factor (VIF) is a statistical measure used to detect multicollinearity in a regression model. Multicollinearity occurs when two or more independent variables in a model are highly correlated, meaning they contain redundant information. High multicollinearity can distort the estimates of coefficients, making it difficult to interpret the individual effect of each predictor.

We also want to build a predictive model. The EDA and the regression model will likely show that some of the key factors of our dataset are useful to make prediction about the type of weight someone will have. Once we identified relationships within our data, we aim to make reliable predictions about future outcomes. the regression will also help us understand which variables have the most significant impact on obesity level. And by using other statistical metrics Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and R², we will assess how well the model performs and refine it as needed to improve accuracy.

3.0.2 5. Conclusion

So far, we have conducted a comprehensive exploration and preparation of our dataset, focusing on understanding the influence of lifestyle factors on obesity within a sample from Mexico, Peru, and Colombia. The dataset, which was pre-processed with SMOTE to address class imbalance, has provided us with balanced obesity categories, facilitating an in-depth analysis of key variables such as eating habits, physical activity, and alcohol consumption. Through correlation analysis, we identified the variables with the strongest associations to obesity levels, helping to guide our selection of factors for inclusion in the next modeling phase. Additionally, we have thoroughly cleaned and structured the data, renaming variables for clarity, formatting categorical variables, and removing duplicates to ensure a solid foundation for robust modeling.

The next steps involve constructing regression models to analyze the relationships and predictive power of these selected factors on obesity levels. Specifically, we will develop two versions of the model—one that includes extreme values and one that excludes them—to evaluate the impact of outliers on model accuracy and stability. Key metrics such as R², P-values, and VIF will be used to confirm the reliability of the model and address potential multicollinearity issues. Following this, we will build and fine-tune a predictive model using metrics like Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and R² to validate and enhance performance.

These efforts will culminate in a final report that, while primarily an exercise and not applicable in real-world contexts, highlights our findings and offers insights into the most influential lifestyle factors affecting obesity. This analysis aims to provide actionable recommendations within a simulated scenario, illustrating how data-driven insights could support public health strategies focused on obesity reduction.

3.1 Next Steps

Outline the next steps planned for completing the project, such as refining analyses, adding new methods, or addressing outstanding data issues.

3.2 Final Thoughts

Briefly reflect on any challenges or limitations encountered so far and how these might be addressed in the final report.